04-Data Transformation

Let’s start by loading the dplyr package:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Did you notice the warning messages? What’s going on there?

It turns out that the dplyr package has a function named filter(), but the stats package, which is automatically loaded when you start an R session, also has a function named filter()! So, if I type the command filter(dataset, ...), how does R know which filter() function to use?

R looks for the function filter() starting with the package that was loaded most recently, and going backwards in time. Since dplyr was the last package loaded, R will assume that we meant dplyr’s version of filter() and use that.

What if I meant the stats version of filter() instead? Is there a way that I can reference it? Yes! We can use “double colon” notation: stats::filter(). (The general syntax for this is packageName::functionName().)

nycflights13

Today we’ll be working with the flights dataset from the nycflights13 package. Let’s load the nycflights13 package and the flights dataset (use install.packages("nycflights13") if you don’t have the packge yet:

library(nycflights13)
data(flights)

Next, use the ?, str() and View() functions to examine the dataset:

?flights
str(flights)
View(flights)

This dataset contains ~336,000 flights that departed from New York City (all 3 airports) in 2013.

Next, just key in the dataset name (i.e. flights):

flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Did you notice that the output format is different from what we’ve seen before? That’s because previous datasets were in a data structure that we called data frames, while this is in a data structure called a tibble. Don’t worry about the difference: for all intents and purposes, data frames are the same as tibbles.

`filter()` and logical operations

Since we are here in Stanford, we may only be interested in flights from NYC to SFO. We can use the filter() verb to achieve this:

flights %>% filter(dest == "SFO")

## # A tibble: 13,331 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      558            600        -2      923
##  2  2013     1     1      611            600        11      945
##  3  2013     1     1      655            700        -5     1037
##  4  2013     1     1      729            730        -1     1049
##  5  2013     1     1      734            737        -3     1047
##  6  2013     1     1      745            745         0     1135
##  7  2013     1     1      746            746         0     1119
##  8  2013     1     1      803            800         3     1132
##  9  2013     1     1      826            817         9     1145
## 10  2013     1     1     1029           1030        -1     1427
## # ... with 13,321 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Note that we used == to test whether dest was equal to "SFO". DO NOT USE =. In programming, = usually means variable assignment.

There are two other international airports near Stanford, San Jose International Airport (“SJC”) and Oakland International Airport (“OAK”). So if we want to analyze flights that people take to get from NYC to Stanford, we should probably include these flights.

flights %>% filter(dest == "SFO" | dest == "SJC" | dest == "OAK")

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      558            600        -2      923
##  2  2013     1     1      611            600        11      945
##  3  2013     1     1      655            700        -5     1037
##  4  2013     1     1      729            730        -1     1049
##  5  2013     1     1      734            737        -3     1047
##  6  2013     1     1      745            745         0     1135
##  7  2013     1     1      746            746         0     1119
##  8  2013     1     1      803            800         3     1132
##  9  2013     1     1      826            817         9     1145
## 10  2013     1     1     1029           1030        -1     1427
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The command above filters the dataset and prints it out, but does not retain the output. To keep the extracted dataset for further analysis, we have to assign it to a variable:

Stanford <- flights %>% filter(dest == "SFO" | dest == "SJC" | dest == "OAK")

We now have flights from NYC to SFO/SJC/OAK for the entire year. Let’s say that I’m only interested in flights when school is in session (Sep - Jun). Since month is a numeric variable, we could do this:

Stanford %>% filter(month <= 6 | month >= 9)

## # A tibble: 11,351 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      558            600        -2      923
##  2  2013     1     1      611            600        11      945
##  3  2013     1     1      655            700        -5     1037
##  4  2013     1     1      729            730        -1     1049
##  5  2013     1     1      734            737        -3     1047
##  6  2013     1     1      745            745         0     1135
##  7  2013     1     1      746            746         0     1119
##  8  2013     1     1      803            800         3     1132
##  9  2013     1     1      826            817         9     1145
## 10  2013     1     1     1029           1030        -1     1427
## # ... with 11,341 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

or this:

Stanford %>% filter(month != 7 & month != 8)

## # A tibble: 11,351 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      558            600        -2      923
##  2  2013     1     1      611            600        11      945
##  3  2013     1     1      655            700        -5     1037
##  4  2013     1     1      729            730        -1     1049
##  5  2013     1     1      734            737        -3     1047
##  6  2013     1     1      745            745         0     1135
##  7  2013     1     1      746            746         0     1119
##  8  2013     1     1      803            800         3     1132
##  9  2013     1     1      826            817         9     1145
## 10  2013     1     1     1029           1030        -1     1427
## # ... with 11,341 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

select() & rename()

Let’s return to the Stanford dataset (i.e. all flights from NYC to SFO/SJC/OAK). Notice that we have a total of 19 variables. Sometimes our datasets will have hundreds or thousands of variables! Not all of them may be of interest to us. select() allows us to choose a subset of these variables to form a smaller dataset that may be easier to work with.

19 is a pretty small number so we could do our data analysis without dropping any columns, but let’s just try out some commands to get a feel for how select() works.

We can select columns by name: if we just want the year, month and day columns, we can use the following code:

Stanford %>% select(year, month, day)

## # A tibble: 13,972 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 13,962 more rows

If the columns we want form a contiguous block, then we can use simpler syntax. To select rows from year to arr_delay (inclusive):

Stanford %>% select(year:arr_delay)

## # A tibble: 13,972 x 9
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      558            600        -2      923
##  2  2013     1     1      611            600        11      945
##  3  2013     1     1      655            700        -5     1037
##  4  2013     1     1      729            730        -1     1049
##  5  2013     1     1      734            737        -3     1047
##  6  2013     1     1      745            745         0     1135
##  7  2013     1     1      746            746         0     1119
##  8  2013     1     1      803            800         3     1132
##  9  2013     1     1      826            817         9     1145
## 10  2013     1     1     1029           1030        -1     1427
## # ... with 13,962 more rows, and 2 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>

In this example, the year column is superfluous, since all the values are all 2013. The code below drops the year column, keeping the rest:

Stanford %>% select(-year)

## # A tibble: 13,972 x 18
##    month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1     1     1      558            600        -2      923            937
##  2     1     1      611            600        11      945            931
##  3     1     1      655            700        -5     1037           1045
##  4     1     1      729            730        -1     1049           1115
##  5     1     1      734            737        -3     1047           1113
##  6     1     1      745            745         0     1135           1125
##  7     1     1      746            746         0     1119           1129
##  8     1     1      803            800         3     1132           1144
##  9     1     1      826            817         9     1145           1158
## 10     1     1     1029           1030        -1     1427           1355
## # ... with 13,962 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

select() can also be used to rearrange the columns. If, for example, I wanted to have the first 3 columns be day, month, year instead of year, month, day:

Stanford %>% select(day, month, year, everything())

## # A tibble: 13,972 x 19
##      day month  year dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1     1     1  2013      558            600        -2      923
##  2     1     1  2013      611            600        11      945
##  3     1     1  2013      655            700        -5     1037
##  4     1     1  2013      729            730        -1     1049
##  5     1     1  2013      734            737        -3     1047
##  6     1     1  2013      745            745         0     1135
##  7     1     1  2013      746            746         0     1119
##  8     1     1  2013      803            800         3     1132
##  9     1     1  2013      826            817         9     1145
## 10     1     1  2013     1029           1030        -1     1427
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

To rename column names, use the rename() function:

Stanford %>% rename(tail_num = tailnum)

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      558            600        -2      923
##  2  2013     1     1      611            600        11      945
##  3  2013     1     1      655            700        -5     1037
##  4  2013     1     1      729            730        -1     1049
##  5  2013     1     1      734            737        -3     1047
##  6  2013     1     1      745            745         0     1135
##  7  2013     1     1      746            746         0     1119
##  8  2013     1     1      803            800         3     1132
##  9  2013     1     1      826            817         9     1145
## 10  2013     1     1     1029           1030        -1     1427
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

arrange()

Often we get datasets which are not in order, or in an order which we are not interested in. The arrange() function allows us to reorder the rows according to an order we want.

The Stanford dataset looks like it is already ordered by actual departure time. Perhaps I’m most interested in the flights which had the longest departure delay. I could sort the dataset as follows:

Stanford %>% arrange(dep_delay)

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    12    11      710            730       -20     1039
##  2  2013    11    16      712            730       -18     1025
##  3  2013     9    11      712            730       -18      946
##  4  2013    11    19      713            730       -17     1036
##  5  2013     7    14     1151           1208       -17     1450
##  6  2013    12    10      714            730       -16     1104
##  7  2013     3    29     1050           1106       -16     1359
##  8  2013     4    20     1420           1436       -16     1737
##  9  2013     5    20      719            735       -16      951
## 10  2013     1    23      545            600       -15      948
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Looks like the flights with the shortest delay are at the top instead! To re-order by descending order, use desc():

Stanford %>% arrange(desc(dep_delay))

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     9    20     1139           1845      1014     1457
##  2  2013     7     7     2123           1030       653       17
##  3  2013     7     7     2059           1030       629      106
##  4  2013     7     6      149           1600       589      456
##  5  2013     7    10      133           1800       453      455
##  6  2013     7    10     2342           1630       432      312
##  7  2013     7     7     2204           1525       399      107
##  8  2013     7     7     2306           1630       396      250
##  9  2013     6    23     1833           1200       393       NA
## 10  2013     7    10     2232           1609       383      138
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

(Wow, that’s a really long delay! Almost 17 hours.) To extract just the flights with the top 10 departure delays, we can use the head() function:

Stanford %>% 
    arrange(desc(dep_delay)) %>%
    head(n = 10)

## # A tibble: 10 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     9    20     1139           1845      1014     1457
##  2  2013     7     7     2123           1030       653       17
##  3  2013     7     7     2059           1030       629      106
##  4  2013     7     6      149           1600       589      456
##  5  2013     7    10      133           1800       453      455
##  6  2013     7    10     2342           1630       432      312
##  7  2013     7     7     2204           1525       399      107
##  8  2013     7     7     2306           1630       396      250
##  9  2013     6    23     1833           1200       393       NA
## 10  2013     7    10     2232           1609       383      138
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

arrange() also allows us to filter by more than one column, in that each additional column will be used to break ties in the values of the preceding ones. For example, flights seems to be sorted by year, month, day, and actual departure time. If I wanted to sort by year, month, day and scheduled departure time instead:

Stanford %>% arrange(year, month, day, sched_dep_time)

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      558            600        -2      923
##  2  2013     1     1      611            600        11      945
##  3  2013     1     1      655            700        -5     1037
##  4  2013     1     1      729            730        -1     1049
##  5  2013     1     1      734            737        -3     1047
##  6  2013     1     1      745            745         0     1135
##  7  2013     1     1      746            746         0     1119
##  8  2013     1     1      803            800         3     1132
##  9  2013     1     1      826            817         9     1145
## 10  2013     1     1     1029           1030        -1     1427
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

mutate()

In this dataset we have both the time the plane spent in the air (air_time) and distance traveled (distance). From these two pieces of information, we can figure out the average speed of the plane for the flight using mutate().

mutate() adds new columns to the end of the dataset, so let’s work with a smaller dataset for now so that we can see the values of our new column.

Stanford_small <- Stanford %>% 
    select(month, carrier, origin, dest, air_time, distance) %>%
    mutate(speed = distance / air_time * 60)
Stanford_small

## # A tibble: 13,972 x 7
##    month carrier origin dest  air_time distance speed
##    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
##  1     1 UA      EWR    SFO        361     2565  426.
##  2     1 UA      JFK    SFO        366     2586  424.
##  3     1 DL      JFK    SFO        362     2586  429.
##  4     1 VX      JFK    SFO        356     2586  436.
##  5     1 B6      JFK    SFO        350     2586  443.
##  6     1 AA      JFK    SFO        378     2586  410.
##  7     1 UA      EWR    SFO        373     2565  413.
##  8     1 UA      JFK    SFO        369     2586  420.
##  9     1 UA      EWR    SFO        357     2565  431.
## 10     1 AA      JFK    SFO        389     2586  399.
## # ... with 13,962 more rows

mutate() can be used to create several new variables at once. For example, the following code is valid syntax:

Stanford_small %>% mutate(speed_miles_per_min = air_time / distance,
                   speed_miles_per_hour = speed_miles_per_min * 60)

## # A tibble: 13,972 x 9
##    month carrier origin dest  air_time distance speed speed_miles_per…
##    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>            <dbl>
##  1     1 UA      EWR    SFO        361     2565  426.            0.141
##  2     1 UA      JFK    SFO        366     2586  424.            0.142
##  3     1 DL      JFK    SFO        362     2586  429.            0.140
##  4     1 VX      JFK    SFO        356     2586  436.            0.138
##  5     1 B6      JFK    SFO        350     2586  443.            0.135
##  6     1 AA      JFK    SFO        378     2586  410.            0.146
##  7     1 UA      EWR    SFO        373     2565  413.            0.145
##  8     1 UA      JFK    SFO        369     2586  420.            0.143
##  9     1 UA      EWR    SFO        357     2565  431.            0.139
## 10     1 AA      JFK    SFO        389     2586  399.            0.150
## # ... with 13,962 more rows, and 1 more variable:
## #   speed_miles_per_hour <dbl>

If we only want to keep the newly created variables, use transmute() instead of mutate().

A digression: plotting our data

Let’s make use of our plotting skills from last session to see if there are any trends in air time. First, let’s make a histogram of air_time:

library(ggplot2)
ggplot(data = Stanford_small) + 
    geom_histogram(aes(x = air_time))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 162 rows containing non-finite values (stat_bin).

Did you notice the warning message about rows being removed for “containing non-finite values”? If you view the Stanford_small dataset and scroll all the way down, you’ll notice that there are some rows which have NA for air_time. Since we don’t know what the air time is, we can’t compute the speed and we can’t plot it.

As a data analyst, NAs are something to watch out for as they could invalidate your analysis. Why are these data missing? Is it completely at random, or is there something going on? For this session, we will just leave them in the dataset.

It seems like the air time of planes might vary depending on the origin and destination, so let’s facet on these 2 variables:

ggplot(data = Stanford_small) + 
    geom_histogram(aes(x = air_time)) + 
    facet_grid(origin ~ dest)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 162 rows containing non-finite values (stat_bin).

We learn 3 things from this plot: (i) there are no flights from La Guardia (LGA) to any of the 3 airports; (ii) there are no flights from Newark (EWR) to SJC/OAK; and (iii) there are very few flights from NYC to SJC/OAK compared to SFO. It’s hard to tell if there are differences in the distributions; the optional material section explores this question further.

summarize()

Instead of looking at plots, we can try to look at summary statistics instead. What was the mean/median air time for flights in our Stanford_small dataset? We can use the summarize() function to help us:

Stanford_small %>% summarize(mean_airtime = mean(air_time))

## # A tibble: 1 x 1
##   mean_airtime
##          <dbl>
## 1           NA

Stanford_small %>% summarize(median_airtime = median(air_time))

## # A tibble: 1 x 1
##   median_airtime
##            <dbl>
## 1             NA

The NAs are causing us trouble! We need to specify the na.rm = TRUE option to remove NAs from consideration:

Stanford_small %>% summarize(mean_airtime = mean(air_time, na.rm = TRUE))

## # A tibble: 1 x 1
##   mean_airtime
##          <dbl>
## 1         346.

Stanford_small %>% summarize(median_airtime = median(air_time, na.rm = TRUE))

## # A tibble: 1 x 1
##   median_airtime
##            <dbl>
## 1            345

summarize() gives me a summary of the entire dataset. If I want summaries by group, then I have to use summarize() in conjunction with group_by(). group_by() changes the unit of analysis from the whole dataset to individual groups. The following code groups the dataset by carrier, then computes the summary statistic for each group:

Stanford_small %>%
    group_by(carrier) %>%
    summarize(mean_airtime = mean(air_time, na.rm = TRUE)) %>%
    arrange(desc(mean_airtime))

## # A tibble: 5 x 2
##   carrier mean_airtime
##   <chr>          <dbl>
## 1 AA              348.
## 2 VX              348.
## 3 DL              347.
## 4 B6              347.
## 5 UA              344.

I can also group by more than one variable. For example, if I wanted to count the number of flights for each carrier in each month, I could use the following code:

Stanford_small %>%
    group_by(month, carrier) %>%
    summarize(count = n())

## # A tibble: 60 x 3
## # Groups:   month [?]
##    month carrier count
##    <int> <chr>   <int>
##  1     1 AA        120
##  2     1 B6        121
##  3     1 DL        142
##  4     1 UA        422
##  5     1 VX        124
##  6     2 AA        108
##  7     2 B6        106
##  8     2 DL        127
##  9     2 UA        378
## 10     2 VX        104
## # ... with 50 more rows

We can even “pipe” the dataset to ggplot() to plot the data!

Stanford_small %>%
    group_by(month, carrier) %>%
    summarize(count = n()) %>%
    ggplot(mapping = aes(x = month, y = count, col = carrier)) +
        geom_line() +
        geom_point() +
        scale_x_continuous(breaks = 1:12)

Optional material

The `%in%` operator

Recall that we used the following line of code to extract flights that landed in SFO, SJC or OAK:

Stanford <- flights %>% filter(dest == "SFO" | dest == "SJC" | dest == "OAK")

We can use the %in% operator to make the code more succinct:

flights %>% filter(dest %in% c("SFO", "SJC", "OAK"))

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      558            600        -2      923
##  2  2013     1     1      611            600        11      945
##  3  2013     1     1      655            700        -5     1037
##  4  2013     1     1      729            730        -1     1049
##  5  2013     1     1      734            737        -3     1047
##  6  2013     1     1      745            745         0     1135
##  7  2013     1     1      746            746         0     1119
##  8  2013     1     1      803            800         3     1132
##  9  2013     1     1      826            817         9     1145
## 10  2013     1     1     1029           1030        -1     1427
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The %in% operator is very useful, especially we are checking if dest belongs to a long list of airports.

Joy plots

Let’s remove the rows with air_time being NA:

Stanford_small <- Stanford_small %>%
    filter(!is.na(air_time))

One theory we might have is that different carriers have different air times. Let’s do a facet on carrier:

ggplot(data = Stanford_small) + 
    geom_histogram(aes(x = air_time)) + 
    facet_grid(carrier ~ .)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The first thing we notice is that UA has many more flights than the other carriers. Because all 5 histograms have the same y-axis, this causes the other histograms to be obscured. To allow each histogram to have its own y-axis, we can add a scales argument to facet_grid():

ggplot(data = Stanford_small) + 
    geom_histogram(mapping = aes(x = air_time)) + 
    facet_grid(carrier ~ ., scales = "free_y")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As you can see, the histograms have very similar shapes, suggesting that the air times of various carriers is roughly the same. The one thing that we might notice is are the tails on the right.

A plot that is increasing in popularity for plotting multiple histograms or density plots is the joy plot. The plot looks like a series of overlapping mountain ranges which can be compared against each other more easily than the histograms. The code below produces a joy plot:

library(ggridges)

## 
## Attaching package: 'ggridges'

## The following object is masked from 'package:ggplot2':
## 
##     scale_discrete_manual

ggplot(data = Stanford_small, aes(x = air_time, y = carrier)) +
    geom_density_ridges(scale = 5)

## Picking joint bandwidth of 3.24

(Play around with the scale parameter and see what happens.)

Session info

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggridges_0.5.0     ggplot2_3.0.0      bindrcpp_0.2.2    
## [4] nycflights13_1.0.0 dplyr_0.7.6        knitr_1.20        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17     pillar_1.3.0     compiler_3.5.1   plyr_1.8.4      
##  [5] bindr_0.1.1      tools_3.5.1      digest_0.6.15    evaluate_0.10.1 
##  [9] tibble_1.4.2     gtable_0.2.0     pkgconfig_2.0.1  rlang_0.2.1     
## [13] cli_1.0.0        yaml_2.1.19      withr_2.1.2      stringr_1.3.1   
## [17] rprojroot_1.3-2  grid_3.5.1       tidyselect_0.2.4 glue_1.2.0      
## [21] R6_2.2.2         fansi_0.2.3      rmarkdown_1.10   purrr_0.2.5     
## [25] reshape2_1.4.3   magrittr_1.5     backports_1.1.2  scales_0.5.0    
## [29] htmltools_0.3.6  assertthat_0.2.0 colorspace_1.3-2 labeling_0.3    
## [33] utf8_1.1.4       stringi_1.2.3    lazyeval_0.2.1   munsell_0.5.0   
## [37] crayon_1.3.4

04-Data Transformation

Kenneth Tay

Oct 11, 2018

nycflights13

`filter()` and logical operations

select() & rename()

arrange()

mutate()

A digression: plotting our data

summarize()

Optional material

The `%in%` operator

Joy plots

Session info

04-Data Transformation

Kenneth Tay

Oct 11, 2018

nycflights13

filter() and logical operations

select() & rename()

arrange()

mutate()

A digression: plotting our data

summarize()

Optional material

The %in% operator

Joy plots

Session info

`filter()` and logical operations

The `%in%` operator